This report looks is an update to the analysis shown on 1/14/2022. Most steps are the same with tweaking due to different data and outlier procedure

at exploring the relationship between wastewater and cases. There are four components to this analysis.

  1. Removing putative outliers

  2. Binning analysis

  3. Smoothing signal

  4. Statistical analysis

This report does not present any final answers but presents some very convincing heuristics.

Files Used:

./../../data/processed/DHSCaseData-4_21_2022.csv

./../../data/processed/DHSWasteData-4_21_2022.csv

1 Data: The first look

The two data sets used in this analysis are the Madison case data sourced from the Wisconsin DHS and wastewater concentration data produced by the Wisconsin State Laboratory of Hygiene. This wastewater data has entries every couple of days from 15 September 2020 to 19 April 2022.

Date Site Cases SevenDayMACases N1 N2 GeoMeanN12 N1FlowPop
2020-09-15 Madison 65 149.00000 63618 32447 45433.614 539978557
2020-09-19 Madison 129 120.00000 11442 9003 10149.499 93104069
2020-09-22 Madison 65 87.42857 39145 8102 17808.784 358522536
2020-09-23 Madison 122 80.71429 30509 7419 15044.809 278353421
2020-09-24 Madison 90 84.28571 16743 4437 8619.089 153420304
2020-09-25 Madison 92 89.00000 28553 1210 5877.851 264186511
##    min(Date)  max(Date)
## 1 2020-09-15 2022-04-19

The case data has a strong weekend effect so for this section we look at a seven day smoothing of cases. The simple display of the data shows the core components of this story. First, wastewater data is noisy. And that there is a clear relationship between the two signals.

FirstImpressionDF <- FullDF%>%
  NoNa(SelectedIndVar,"Cases")#Removing NA

FirstImpression <- FirstImpressionDF%>%
  ggplot(aes(x=Date))+#Data depends on time
  geom_point(aes(y=(Cases), color="Cases",info=Cases),size = 1)+
  geom_line(aes(y=MinMaxFixing(!!sym(SelectedIndVar),Cases), 
                color=SelectedIndVar,
                info=!!sym(SelectedIndVar)))+#compares SelectedIndVar to Cases
  geom_line(aes(y=(SevenDayMACases), 
                color="Seven Day MA Cases",
                info=Cases))+
  labs(y="Reported cases")+
  ColorRule


ggplotly(FirstImpression)%>%
    add_lines(x=~Date, y=FirstImpressionDF[[SelectedIndVar]],
            yaxis="y2", data=FirstImpressionDF, showlegend=FALSE, inherit=FALSE) %>%
    layout(yaxis2 = SecondAxisFormat,
           legend=list(title=list(text=''),x = 1.15, y = 0.9),
           margin = list(l = 50, r = 75, b = 50, t = 50, pad = 4))

Figure 1.1: Wastewater concentration and daily Covid-19 case data for Madison. A seven day moving average of cases is used to reduce a day of the week effect.

#To remoive weekend effects we are looking at the 7 day smoothing of cases.

2 Removing potential outliers

Looking at the wastewater measurements we observe there were some points many times larger than adjacent values hinting at them being outliers. We used the adjacent 10 values on each side and marked points 2.5 standard deviations away from the group mean as outliers.

#default pass to IdentifyOutliers
#method="SD", align="center", n = 5, Bin = 21, Action = "Flag"

ErrorMarkedDF <- FullDF%>%#
    mutate(FlagedOutliers = IdentifyOutliers(!!sym(SelectedIndVar), Action = "Flag"),
           #Manual flagging that method misses due to boundary effect with binning
           FlagedOutliers = ifelse(Date == mdy("01/24/2022"),
                                   TRUE, FlagedOutliers),
           NoOutlierVar = ifelse(FlagedOutliers, NA, !!sym(SelectedIndVar)))

#Split N1 into outlier and non outlier for next ggplot
OutLierPlotDF <- ErrorMarkedDF%>%
  mutate(!!OutlierName := ifelse(FlagedOutliers,!!sym(SelectedIndVar),NA))%>%
           mutate(!!SelectedIndVar := NoOutlierVar)

OutLierPlotObject <- OutLierPlotDF%>%
  filter(!(is.na(!!sym(SelectedIndVar))&is.na(!!sym(OutlierName))))%>%
  ggplot(aes(x=Date))+#Data depends on time 
  geom_line(aes(y=!!sym(SelectedIndVar),
                color=SelectedIndVar, 
                info = !!sym(SelectedIndVar)))+#compares Var to Cases
  geom_point(aes(y=!!sym(OutlierName),
                 color=OutlierName,
                 info = !!sym(OutlierName)))+
  ColorRule

#mentioned hand picked list other choices
ggplotly(OutLierPlotObject,tooltip=c("info","Date"))%>%
    layout(yaxis = SecondAxisFormat,
           legend=list(title=list(text=''),x = 1.15, y = 0.9),
           margin = list(l = 50, r = 75, b = 50, t = 50, pad = 4))

Figure 2.1: Wastewater concentration for Madison with potential outliers marked. Using a rolling symmetrical bin of 21 days as a sample we use 2.5 standard deviations of the bin as a metric to reject extreme points. This process is ran multiple times to get a robust process to select outliers.

#Drop Var create Var filter 
UpdatedDF <- ErrorMarkedDF%>%
  select(-sym(SelectedIndVar))%>%
  rename(!!sym(SelectedIndVar) := NoOutlierVar)

3 Binning

To isolate this relationship we used a primitive binning relationship. We used non overlapping bins of 2 weeks and took the median of the data within that range. This reduces autocorrelation issues and masks potential noise in the data. We see a very strong trend slightly improved without the outliers.

#StartDate is Where the binning starts
#DaySmoothing is The size of the bins
#Lag is The offset between Cases and Var
BinnedVarName <- paste("Binned",SelectedIndVar)
Bining <- function(DF,StartDate=1,DaySmoothing=14,Lag=0){
  BinDF <- DF%>%
    select(Date, Cases, !!sym(SelectedIndVar))%>%
    mutate(MovedCases = data.table::shift(Cases, Lag),#moving  SLD lag days backwards
        Week=(as.numeric(Date)+StartDate)%/%DaySmoothing)%>%#putting variables into bins via integer division
    group_by(Week)%>%
    summarise("Binned Cases" := median(MovedCases, na.rm=TRUE), 
              !!BinnedVarName := (median((!!sym(SelectedIndVar)),
        na.rm=TRUE)), 
        Date = median(Date,na.rm = TRUE))#summarize data within bins.
  return(BinDF)
}

BinErrorRemovedDF <- Bining(UpdatedDF)
BinErrorKeptDF <- Bining(FullDF)

DiffrenceDF <- inner_join(BinErrorRemovedDF,BinErrorKeptDF,by=c("Week","Date"))%>%
  filter(!!paste0(BinnedVarName,".x") != !!paste0(BinnedVarName,".y"))

BinedCorrGraph <- ggplot()+
  geom_segment(aes(x = !!sym("Binned Cases.x"), 
                   y = !!sym(paste0(BinnedVarName,".x")), 
                   xend = !!sym("Binned Cases.y"),
                   yend = !!sym(paste0(BinnedVarName,".y"))),
                data = DiffrenceDF)+
  geom_point(aes(x = !!sym("Binned Cases"), 
                 y = !!sym(BinnedVarName),
                 color = "outliers not removed",
                 info = Date),
             size = 2, 
             data = BinErrorKeptDF,
             shape=15)+
  geom_point(aes(x = !!sym("Binned Cases"), 
                 y = !!sym(BinnedVarName),
                 color = "outliers removed",
                 info = Date),
             data = BinErrorRemovedDF)+
  ggtitle(paste0(BinnedVarName,", Cases removed potential outliers"))+
  #geom_abline(slope = 3000)+
  labs(x="Binned Cases",y=BinnedVarName)

ggplotly(BinedCorrGraph,tooltip=c("x","y","info"))%>%
    layout(legend=list(title=list(text=''),x = 1.15, y = 0.9),
           margin = list(l = 50, r = 75, b = 50, t = 50, pad = 4))

Figure 3.1: Binned wastewater concentration and daily cases for Madison. Red squares are the median value of the bins without removing the flagged outliers. Blue circles are the median value of the bins removing the flagged outliers. The Black line connects the red square and blue circle representing the same bin.

#cor(BinDF$BinnedN1, BinDF$BinnedCases, use="pairwise.complete.obs")
#summary(lm(BinnedCases~BinnedN1, data=BinDF))
OutputBinning <- data.frame(row.names=c("correlation"),
  WithOutliers = c(cor(BinErrorKeptDF[[BinnedVarName]], 
                       BinErrorKeptDF[["Binned Cases"]],
                       use="pairwise.complete.obs")),
  
  WithOutOutliers = c(cor(BinErrorRemovedDF[[BinnedVarName]],
                          BinErrorRemovedDF[["Binned Cases"]],
                          use="pairwise.complete.obs")))
formattable(OutputBinning)
WithOutliers WithOutOutliers
correlation 0.6794737 0.6137276

4 Data smoothing

The goal in this section is to smooth the data to get a similar effect without losing resolution.

4.1 Case smoothing

A key component to this is that the relationship between N1 and Case involves a gamma distribution modeling both the time between catching Covid-19 and getting a test and the concentration of the shedded particles. We found a gamma distribution with mean 11.73 days and a standard deviation of 7.68 gives good results and matches other research (Fernandez-Cassi et al. 2021).

Mean <- 11.73
StandardDeviation <- 7.68
Scale = StandardDeviation^2/Mean
Shape = Mean/Scale
SLDWidth <- 21

weights <- dgamma(1:SLDWidth, scale = Scale, shape = Shape)
par(mar=c(4,4,4,10))
plot(weights,  
        main=paste("Gamma Distribution with mean =",Mean, "days,and SD =",StandardDeviation), 
            ylab = "Weight", 
            xlab = "Lag")
gamma distribution used for shedding lag distribution

Figure 4.1: gamma distribution used for shedding lag distribution

SLDSmoothedDF <- SLDSmoothMod(UpdatedDF,SLDWidth)


SLDPlot = SLDSmoothedDF%>%
  #NoNa("SLDCases")%>%#same plot as earlier but with the SLD smoothing
  ggplot(aes(x=Date))+
  geom_line(aes(y=Cases, 
                color="Cases" , info = Cases),alpha=.2)+
  geom_line(aes(y=SevenDayMACases,
                color="Seven Day MA Cases" , info = SevenDayMACases),alpha=.4)+
  geom_line(aes(y=SLDCases, color="SLD Cases",info = SLDCases))+
  labs(y="Reported Cases")+
  ColorRule

ggplotly(SLDPlot,tooltip=c("info","Date"))%>%
  layout(legend=list(title=list(text=''),x = 1.15, y = 0.9),
         margin = list(l = 50, r = 75, b = 50, t = 50, pad = 4))

Figure 4.2: Madison Case data for Madison. SLD Cases is a weighted mean of cases using the gamma distribution as the weight distribution.

4.2 viral load smoothing

To get a good smoothing of the N1 measurement we employ loess smoothing. Loess smoothing takes a locally weighted sliding window using some number of points. we found the best smoothing when it uses data within approximately 2 weeks of both sides of the data. The displayed plot shows the visual power of this smoothing. We see in general that the smoothed N1 trails SLD. However loess is symmetric meaning that it can not be used in predictive modeling due to it using points from the future to smooth points.

#SLDSmoothedDF[["SLDCases"]] <- loessFit(y=(SLDSmoothedDF[["SevenDayMACases"]]), 
#                      x=SLDSmoothedDF$Date, #create loess fit of the data
#                      span=.05, #span of .2 seems to give the best result, not rigorously chosen
#                      iterations=2)$fitted#2 iterations remove some bad patterns
SLDSmoothedDF <- LoessSmoothMod(SLDSmoothedDF,SelectedIndVar, loessVar, SpanConstant)


SLDLoessGraphic <- SLDSmoothedDF%>%
  NoNa(loessVar,"SLDCases")%>%
  ggplot(aes(x=Date))+
  geom_line(aes(y=Cases, color="Cases" , info = Cases),alpha=.1)+
  geom_line(aes(y=MinMaxFixing(!!sym(SelectedIndVar),SLDCases),
                color=SelectedIndVar,
                info = !!sym(SelectedIndVar)),
            alpha=.2)+
  geom_line(aes(y=SevenDayMACases,
                color="Seven Day MA Cases" , 
                info = SevenDayMACases),
            alpha=.3)+
  geom_line(aes(y=MinMaxFixing(!!sym(loessVar),SLDCases,!!sym(SelectedIndVar)), 
                color=loessVar ,
                info = !!sym(loessVar)))+
  geom_line(aes(y=SLDCases,
                color="SLD Cases" ,
                info = SLDCases))+
  labs(y="Reported cases")+
  ColorRule


ggplotly(SLDLoessGraphic,tooltip=c("info","Date"))%>%
    add_lines(x=~Date, y=SLDSmoothedDF[[SelectedIndVar]],
            yaxis="y2", data=SLDSmoothedDF, showlegend=FALSE, inherit=FALSE) %>%
    layout(yaxis2 = SecondAxisFormat,
           legend=list(title=list(text=''),x = 1.15, y = 0.9),
           margin = list(l = 50, r = 75, b = 50, t = 50, pad = 4))

Figure 4.3: Loess smoothed N1 and SLD cases for Madison data. Using a Locally Weighted Scatterplot Smoothing process along with the previous figure SLD cases we get the most sophisticated relationship between the two signals discussed in this document.

5 Towards a formal analysis

Cross correlation and Granger Causality are key components to formalize this analysis. Cross correlation looks at the correlation at a range of time shifts and Granger analysis performs a test for predictive power.

CCFChar <- function(ccfObject){
  LargestC = max(ccfObject$acf)
  Lag = which.max(ccfObject$acf)-21
  return(c(LargestC,Lag))
}

ModelTesting <- function(DF,Var1,Var2){
  UsedDF <- DF%>%
    NoNa(Var1,Var2)#removing rows from before both series started
  
  
  Vec1 <- unname(unlist(UsedDF[Var1]))
  Vec2 <- unname(unlist(UsedDF[Var2]))

  CCFReport <- CCFChar(ccf(Vec1,Vec2,na.action=na.pass,plot = FALSE))

  VarPredCase <- grangertest(shift(Vec1,max(0,CCFReport[[1]])), Vec2, order = 1)$"Pr(>F)"[2]
  CasePredVar <- grangertest(Vec2, shift(Vec1,min(0,CCFReport[[1]])), order = 1)$"Pr(>F)"[2]
  return(round(c(CCFReport,CasePredVar,VarPredCase),4))
}

#ErrorRemovedDF
BaseLine <- ModelTesting(FullDF,SelectedIndVar,"Cases")
BaseLineSevenDay <- ModelTesting(FullDF,SelectedIndVar,"SevenDayMACases")
ErrorRemoved <- ModelTesting(UpdatedDF,SelectedIndVar,"SevenDayMACases")
SLDVar <- ModelTesting(SLDSmoothedDF,SelectedIndVar,"SLDCases")
SevenLoess <- ModelTesting(SLDSmoothedDF,loessVar,"SevenDayMACases")
SLDLoess <- ModelTesting(SLDSmoothedDF,loessVar,"SLDCases")


Output <- data.frame(row.names=c("Max Cross Correlation","Lag of largest Cross correlation","P-value WasteWater predicts Cases","P-value Cases predicts wastewater"),
  CasesvsVar = BaseLine,
  SevenDayMACasesvsVar = BaseLineSevenDay,
  ErrorRemoved = ErrorRemoved,
  SLDVar = SLDVar,
  SevenLoess = SevenLoess,
  SLDLoess = SLDLoess)

OutputRightPosition <- transpose(Output)
colnames(OutputRightPosition) <- rownames(Output)
rownames(OutputRightPosition) <- c(paste("Section 1: Cases vs" , SelectedIndVar),
                                   paste("Section 1: 7 Day MA Cases vs" , SelectedIndVar),
                                   paste("Section 2: Cases vs" , SelectedIndVar),
                                   paste(" Section 4.2: SLD Cases vs ",SelectedIndVar),
                                   paste("Section 4.3: 7 Day MA Cases vs Loess smoothing of ",SelectedIndVar),
                                   paste("Section 4.3: SLD Cases vs Loess smoothing of ",SelectedIndVar))

formattable(OutputRightPosition)
Max Cross Correlation Lag of largest Cross correlation P-value WasteWater predicts Cases P-value Cases predicts wastewater
Section 1: Cases vs N1 0.7151 19 0.0263 0.2010
Section 1: 7 Day MA Cases vs N1 0.4252 -7 0.0005 0.9016
Section 2: Cases vs N1 0.7300 -7 0.0519 0.0755
Section 4.2: SLD Cases vs N1 0.7463 -14 0.0052 0.0741
Section 4.3: 7 Day MA Cases vs Loess smoothing of N1 0.7122 -9 0.0000 0.0000
Section 4.3: SLD Cases vs Loess smoothing of N1 0.8036 -16 0.0000 0.0028
ccf(FullDF$N1,SLDSmoothedDF$Cases,na.action = na.pass)

ccf(SLDSmoothedDF$loessN1,SLDSmoothedDF$SevenDayMACases,na.action = na.pass)

ccf(SLDSmoothedDF$loessN1,SLDSmoothedDF$SLDCases,na.action = na.pass)

Fernandez-Cassi, Xavier, Andreas Scheidegger, Carola Bänziger, Federica Cariti, Alex Tuñas Corzon, Pravin Ganesanandamoorthy, Joseph C. Lemaitre, Christoph Ort, Timothy R. Julian, and Tamar Kohn. 2021. “Wastewater Monitoring Outperforms Case Numbers as a Tool to Track COVID-19 Incidence Dynamics When Test Positivity Rates Are High.” Water Research 200: 117252. https://doi.org/10.1016/j.watres.2021.117252.